A Comparative Study of Centroid-Based, Neighborhood-Based and Statistical Approaches for Effective Document Categorization

نویسندگان

  • Vincent Tam
  • Ardi Santoso
  • Rudy Setiono
چکیده

Associating documents to relevant categories is critical for effective document retrieval. Here, we compare the well-known k-Nearest Neighborhood (kNN) algorithm, the centroid-based classifier and the Highest Average Similarity over Retrieved Documents (HASRD) algorithm, for effective document categorization. We use various measures such as the micro and macro F1 values to evaluate their performance on the Reuters-21578 corpus. The empirical results show that kNN performs the best, followed by our adapted HASRD and the centroid-based classifier for common document categories, while the centroid-based classifier and kNN outperform our adapted HASRD for rare document categories. Additionally, our study clearly indicates that each classifier performs optimally only when a suitable term weighting scheme is used. All these significant results lead to many exciting directions for future exploration.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Document Categorization

Assigning documents to related categories is critical task which is used for effective document retrieval. Automatic text classification is the process of assigning new text document to the predefined categories based on its content. In this paper, we implemented and performed comparison of Naïve Bayes and Centroid-based algorithms for effective document categorization of English language text....

متن کامل

Empirical Evaluation of Centroid-based Models for Single-label Text Categorization

Centroid-based models have been used in Text Categorization because, despite their computational simplicity, they show a robust behavior and good performance. In this paper we experimentally evaluate several centroidbased models on single-label text categorization tasks. We also analyze document length normalization and two different term weighting schemes. We show that: (1) Document length nor...

متن کامل

Town trip forecasting based on data mining techniques

In this paper, a data mining approach is proposed for duration prediction of the town trips (travel time) in New York City. In this regard, at first, two novel approaches, including a mathematical and a statistical approach, are proposed for grouping categorical variables with a huge number of levels. The proposed approaches work based on the cost matrix generated by repetitive post-hoc tests f...

متن کامل

Applying the principles governing the traditional neighborhood development (TND) approach in the Islamic Iranian city Study sample: Fahadan neighborhood of Yazd city

The TND approach has concepts with the aim of raising the quality of life and strengthening and improving the physical space of the neighborhood, increasing social interactions and improving the sense of place and economic self-reliance. Iranian-Islamic urban planning, like its origins in Islam and Shiite culture, is dynamic and constantly offers new methods to human societies, and is more of a...

متن کامل

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002